Using Proper Names to Cluster Documents

نویسندگان

  • Dan Winchester
  • Mark Lee
چکیده

Proper Names are a frequent occurrence in all types of natural language text. However, the treatment of proper names is an area under-researched by Natural Language Processing. One particular problem is how to link information about the same entity referred to by possibly different proper names in several documents. In this paper we describe a prototype system which first pre-processes individual documents using a simple name-conflation algorithm and then uses an adaptation of Schutze's context-group discrimination algorithm to cluster documents that are judged to contain references to the same named entity. We use this system to assess the potential utility of different contextual

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Clinical Document Clustering using Multi-view Non-Negative Matrix Factorization

Clinical document contains vital information like symptom names, medication names, age, gender and some demographical information. These information can be used for giving quick relief from a disease. In existing system, they had built a system for clustering symptom names and medication names using Multi-View Non-Negative Matrix Factorization. While considering the clinical documents the facto...

متن کامل

Identification of related multilingual documents using ant clustering algorithms Identificación de documentos multilingües relacionados mediante algoritmos de clustering de hormigas

This paper presents a document representation strategy and a bio-inspired algorithm to cluster multilingual collections of documents in the field of economics and business. The proposed approach allows the user to identify groups of related economics documents written in Spanish and English using techniques inspired on clustering and sorting behaviours observed in some types of ants. In order t...

متن کامل

Proper name retrieval from diachronic documents for automatic speech transcription using lexical and temporal context

Proper names are usually key to understanding the information contained in a document. Our work focuses on increasing the vocabulary coverage of a speech transcription system by automatically retrieving new proper names from contemporary diachronic text documents. The idea is to use in-vocabulary proper names as an anchor to collect new linked proper names from the diachronic corpus. Our assump...

متن کامل

Textual Similarity based on Proper Names

Proper names represent about 10% of English or French newspaper articles. Their quantity and informational quality is already used in different Information Extraction systems. Proper names have widely been studied in the MUC conferences designed to promote research in Information Extraction. We have created our own named entity extraction tool based on a linguistic description with automata. Th...

متن کامل

Ajout de nouveaux noms propres au vocabulaire d'un système de transcription en utilisant un corpus diachronique

Proper names are usually keys to understand the information contained in a document. Our work focuses on increasing the vocabulary size of a speech transcription system by automatically retrieving proper names from contemporary diachronic text corpus. We assume that some proper names appear in documents relating to the same time period and in similar lexical contexts. We proposed methods that d...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002